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ABSTRACT 

Pythoscape is a framework implemented in Python for processing 
large protein similarity networks for visualization in other software 
packages. Protein similarity networks are graphical representations 
of sequence, structural and other similarities among proteins for 
which pairwise all-by-all similarity connections have been calculated. 
Mapping of biological and other information to network nodes or 
edges enables hypothesis creation about sequence-structure-func- 
tion relationships across sets of related proteins. Pythoscape provides 
several options to calculate pairwise similarities for input sequences or 
structures, applies filters to network edges and defines sets of similar 
nodes and their associated data as single nodes (termed representa- 
tive nodes) for compression of network information and output data or 
formatted files for visualization. 
Contact: babbitt@cgl.ucsf.edu 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

The rapid growth of databases of protein information (e.g. se- 
quences and structures) provides both new opportunities and 
challenges for analysis and clustering by similarity. For example, 
global analysis of entire superfamilies and association of their 
members with biological information and other types of meta- 
data has become a useful tool for functional annotation and 
discovery (Brown and Babbitt, 2012). As these sets become 
larger (sometimes many thousands of sequences) and their mem- 
bers more divergent, their fast exploration on a large-scale 
becomes less feasible using traditional approaches such as align- 
ments and trees. 

Protein similarity networks (PSNs) enable analysis and visual- 
ization of structure-function relationships in large protein data 
sets by clustering of individual protein sets for more complex 
analysis while summarizing 'connectivity' relationships among 
the clusters. Mapping orthogonal sources of biological informa- 
tion onto PSNs then provides a powerful way to view functional 
trends across the set that can be interpreted in the context of their 
similarities. (See Atkinson et aL, 2009 for an initial analysis of 
some uses and statistical validation of PSNs.) 

While databases like Similarity Matrix of Proteins (SIMAP) 
(Rattie et aL, 2010) store pairwise similarities, and plug-ins 
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available with software such as Cytoscape (Smoot et aL, 2011) 
allow creation of small PSNs (Wittkop et aL, 2010), no software 
solution exists to create and manage large PSNs. And while 
PSNs are inherently amenable to association with orthogonal 
information sources, the many information types available com- 
plicate development of a single software solution for managing 
such diverse features. Pythoscape addresses these issues and pro- 
vides a software framework to create PSNs and develop new 
analyses for inference of functional properties in proteins. 

2 DESCRIPTION AND SIGNIFICANCE 

Pythoscape is an extensible computational framework imple- 
mented in Python to generate and analyze PSNs. For the user 
interested in generating large networks, the Pythoscape package 
has a core set of plug-ins (Supplementary Table SI) and tutorials, 
so that no development is needed to create simple networks 
painted with useful metadata. For software developers, Pythos- 
cape provides a framework for rapid modification along with 
well-documented application programming interfaces for devel- 
opment of additional plug-ins using new sources of metadata. 

Unlike sparser networks such as interaction networks, PSNs 
are frequently close to complete, often requiring storage and 
management of large quantities of data, and fast calculation 
(Supplementary Table S2). Pythoscape allows for flexible storage 
of data through the use of storage interfaces. Appropriate stor- 
age solutions can be chosen based on network size or developed 
as needed allowing for easy updating for faster and more reliable 
database software solutions. Pythoscape can create, store and 
manage large networks, then, using representative nodes and 
edges to compress the information, output smaller summary net- 
works for visualization (Fig. 1A and B). Users can choose how 
distances between representative nodes are calculated and, im- 
portantly, the full set of sequences in each node is retained for 
later use. 

Additionally, Pythoscape has plug-ins for creating structure 
similarity networks and for generating correlations for edge 
distances between networks generated from a set of sequences 
and a corresponding set of available structures (Supplementary 
Table SI and Supplementary Figs. S2 and S3). 

3 EXAMPLE USAGE 

Glutathione transferases (GSTs) are enzymes that typically cata- 
lyze the addition of glutathione to substrate compounds. They 
play roles in many biological processes, including metabolism of 
endogenous compounds and xenobiotics such as drugs. Of the 



© The Author 2012. Published by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.Org/licenses/by/3.0/), which 
permits unrestricted, distribution, and reproduction in any medium, provided the original work is properly cited. 



A.E.Barber and P.C.Babbitt 




Swiss P rot GST family 
O Alpha 
O Beta 
O HSP26 

• HSP26 & Phi 

• HSP26, Phi&Tau 

• Mu 
0 Omega 



• Phi 

• Pi 

O Sitgma 
OTsu 
OTheta 

• Zeta 

O Other/None 



# of represented sequences {A only) 

0 100-300 o 10-99 -1-9 



PDB structure 



Fig. 1. Sequence similarity network of the GST superfamily generated by 
Pythoscape and visualized in Cytoscape. To compact the view for this 
figure, networks were layed out using the organic layout in Cytoscape 
rather than the distances computed from a similarity metric. In all, 664 
representative nodes are used to describe pairwise relationships among 
7447 sequences. (A) Representative network with functional classes col- 
ored, if annotated by SwissProt in a family (The UniProt Consortium, 
2011). Family membership is indicated if one or more sequences in the 
abstracted node are associated with that family. (B) Full non-abstracted 
network for the group of GSTs found mostly in eukaryotes (boxed in A) 



thousands of GSTs that have been identified, the physiological 
substrates of only a small proportion are known; thus, they are 
principally classified into putative functional classes according to 
enzymatic, structural, and other features (Mannervik and 
Danielson, 1988). Recently, PSNs have been used to summarize 
and guide a global interpretation of GST sequence and structure 
relationships (Atkinson and Babbitt, 2009). 

A PSN of GST sequences is shown in Figure 1A (see supple- 
mentary information for network creation and graph statistics). 
It illustrates how representative nodes computed by Pythoscape 
enable analysis of PSNs too large to be visualized in total while 
retaining their value for developing hypotheses from sequence 
similarities across the whole set. For comparison, individual clus- 
ters of interest can be outputted with all nodes present (Fig. IB). 
This full non-abstracted network (representing a node for each 
sequence) shows a similar pattern of relationships to those shown 
in the corresponding representative node network (boxed in 
Fig. 1A). The correlation between the ideal representative node 
mean distances calculated in Pythoscape and the corresponding 
full network ideal distance for Fig. 1 A is provided in Supplemen- 
tary Figure SI. A quantitative description of the relationships 



between filtered networks and full networks has also recently 
been described elsewhere for some example systems (Atkinson 
et ai, 2009), but these differences appear also to depend on the 
specific system analyzed. While 'missing data' is an inherent fea- 
ture of representative nodes, the trade-off is in visualizing simi- 
larity relationships across large datasets that would not be 
practically achievable because of memory and speed limitations 
in their calculation. 

The network shown in Figure 1A demonstrates another issue 
in the use of representative nodes that could complicate inter- 
preting relationships between functional features and sequence 
similarity. In the example given here, some GST families are 
represented by multiple representative nodes, whereas other rep- 
resentative nodes contain multiple SwissProt families (HSP26, 
Phi and Tau), obscuring how sequence similarity tracks with 
annotation. Thus, we recommend that analysis using representa- 
tive networks be accompanied by examination of the relevant 
parts of the corresponding full networks. 



4 CONCLUSION 

Pythoscape is a software framework to efficiently create and 
manage protein similarity networks. Tutorials, Pythoscape docu- 
mentation, source code and future development plans are avail- 
able at http://www.rbvi.ucsf.edu/trac/Pythoscape. 
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